Purpose and Objectives

This analysis examines expressions of “loneliness” through the observation and inspection of declarations of loneliness on social media platform, Twitter over a 6 week period between June and August 2020. The purpose of this analysis is to uncover and document valued actionable insights which are contained within the available source data for the benefit of the target audience.

The target audience includes Barwon Health, the Committee for Geelong and Loneliness In Geelong project team members.

The objective is to explore the extracted Twitter data and identify:

  • Tweets containing mentions of “lonely” or “loneliness” and are created by users which are geographically located within 50km of the Geelong CBD
  • The meaningful context in which loneliness is communicated by Twitter users
  • The emojis used in association with these Tweets and how these can describe the emotional sentiment of the text
  • The sentiment of the Tweet text
  • Geographical “hotspots” and “gaps” where people are communicating from about loneliness
  • Natural groupings and patterns emerging within the data
  • Basic descriptive statistics across the entire data set
  • Further opportunities and inspiration for data analysis to support the investigation of loneliness issues in the Geelong region

Important note: This analysis is a not-for-profit independent analysis conducted by Bree McLennan, using data extracted via the Twitter API for the specific objective as described above. This analysis does not represent the opinions of the Committee for Geelong or Barwon Health.

 

Project workflow

This project is written in a programming language known as R.

The project work-flow will follow this sequence:

 

0. Defining the analysis parameters: Purpose, Objective and Timeline

As described in the Purpose and Objectives section.

 

1. Obtain input and hypotheses from the target audience

As described in the Purpose and Objectives section.

 

2. Obtain source data

About Twitter

Created in 2006, Twitter is an American micro-blogging and social networking service on which users post and interact with messages known as “Tweets”. Registered users can post, like, and Re-Tweet Tweets, but unregistered users can only read them. Twitter has approx. 186 Million daily active users who send about 500 Million Tweets each day (source: Twitter Earnings Press Release, 23rd July 2020).

The basics of Twitter

  • Each user has a profile (page) and can add a photo and information about themselves
  • Users can “follow” each other
  • Users can “Tweet”, publicly share a text message or a multimedia file such as a photo or hyperfine to videos
  • Each Tweet is restricted to a maximum of 280 characters
  • Users can interact with a Tweet via comments (replies), likes, and shares (ReTweets)
  • Users can interact with other users via direct messaging
  • Users can create a thread: A series of connected Tweets
  • Users use hashtags (e.g., #loneliness) in order to associate their Tweets with certain topics and to make them easier to find
  • Users can search for keywords/hashtags in order to find relevant Tweets and users

Data extraction via the Twitter API was performed multiple times over a 6-week period by Bree McLennan between 23/06/2020 and 06/08/2020. The source (raw) dataset file format was comma delimited, “csv”.

The Twitter API data extraction query was configured with the following parameters:

  • Geographical coordinate centroid based in the CBD precinct of Geelong, Victoria, Australia
  • Range 50k radius (Including: Geelong, Winchelsea, Werribee, Surf Coast, Bellarine Peninsula, Bannockburn, Golden Plains Shire)
  • Extract Tweets created in the last 7 days (multiple runs required to capture daily representation over 6-week time period)
  • Search terms: “lonely” and “loneliness”. Search query searching for Tweets containing either of these terms.
  • Exclude Re-Tweets
  • Include all available public information about twitter users who made the Tweets containing the search criteria

 

3. Obtain relevant research to assist in this analysis

In the context of exploring loneliness in the Geelong region, we have an opportunity to investigate the communication practices around expressions of loneliness in this social environment. By looking at the factors surrounding the expression of loneliness, we can gain a better understanding of the perceptions of loneliness.

We seek to better understand how people communicate about loneliness when they are addressing a network of friends and acquaintances and the types of expressions of loneliness which are communicated.

Exploring and analysing the subject of loneliness in the Geelong community is a subjective and complex endeavour. To help guide this analysis and align with peer reviewed, robust methods, this analysis applies methods proposed and discussed in these two white papers, which specifically explore human loneliness in the social media and twitter data context:

Applying this research will assist in:

  • Meaningfully labelling the unstructured twitter text data
  • Unpacking the meaningful context in which loneliness is communicated
  • Provide more perspective on how we can analyse “lonely Tweets”
  • Highlighting contextual gaps which the research may not cover, revealing unique opportunities for exploration

 

4. Obtain relevant reference data to assist in this analysis

There are specific areas of this analysis which require additional reference data in order to interpret and extract meaning from the analysis results. These areas are:

  • Working with emojis and emoticons (Unicode data)
  • Measuring the “Russell Effect” in text

When working with emojis and emoticons in source text data, these usually need to be translated from which ever Unicode notation and source system which has been used to encode them to something which statistical software can interpret. In this analysis, the source data contained formal Unicode notation. Transforming the source data to one row per emoji per Tweet and applying character string manipulation, the emojis were converted to simpler Unicode representations (stripped of special characters, with the suffix after “U” converted to lower case and another representation of stripping only the “< >” characters), this made it possible to:

  1. Join the data to the official Unicode emoji definition list[6], and access the emoji categorisation labels
  2. Join the data to the Emoji Sentiment Ranking dataset[8], and access sentiment scores for each emoji
  3. Join the data to the Emoji Graphical Image Repository[7] to obtain image hyperlinks so it can be visualised. This completely avoids the issue with fonts and encoding on Microsoft Windows devices.

When measuring the Russell Effect, a lexicon is required, which contains a list of common english words and their numerical values describing the “valence”, “arousal” and “dominance” that each word represents.

 

5. Technical approach to conduct this analysis

  • R Project File
    • Twitter data extraction. Executed once per week over 6 week time frame.
    • Append the collection of raw Twitter data extract CSV files and create one combined CSV file for analysis
    • Manual “mechanical turk” activity to apply labels to the data for further analysis, and application of social behaviour research. Results saved as “.XLSX” and converted to “.CSV” upon completion. Load the combined and labelled twitter CSV data for analysis.
    • Assess, prepare & clean data
    • Flag data integrity issues, drop non-relevant variables
    • Prepare primary analysis dataset
    • Prepare twitter text dataset for text and sentiment analysis
    • Prepare twitter text emoji usage dataset for emoji sentiment analysis
    • Prepare statistical summaries, observe and annotate results
  • Peer review analysis
  • Microsoft Power BI report
  • Prepare interactive visual statistical summaries

 

6. Data considerations

  1. Information security & intellectual property
  2. Converting raw twitter data into useful dataset
  3. Filtering the analysis datasets (identifying “lonely” Tweets)
  4. Sampling versus Population and handling low sample size
  5. Applied Research, data dictionary and definitions for variables
  6. Manual inspection of raw twitter data, declaration of observations

 

7. Manual data exploration  

8. Automated data exploration  

9. Findings and Opportunities  

10. Next Steps  

11. References  

6. Data considerations

a. Information security & intellectual property

For the purpose of this analysis, the analysis datasets contain de-identified data. No names or contact details are used or included.

In the Tweet examples used throughout this analysis, where multimedia (such as digital photos, artwork, music, poetry) is shown, the copyright of these items belongs to the creators who created these materials.

Other important considerations when working with Twitter data:

  • Replicability and Black-Box Twitter: Real-time Twitter data collection is not reproducible and for a given query. There is no guarantee that extracting data via the (free cost) Twitter API will result in a true random or accurate sample of Tweets. Access to to all available Tweets incurs fees.

  • Data Privacy and Research Ethics: Tweets on public Twitter profiles are generally available. There are no measures in place that prevent the collection and analysis of the data, and users’ consent for the collection and processing of their Tweets and profile information is usually not required. However, Twitter’s terms of service are not necessarily congruous with data protection regulations in some countries. Ultimately, this leaves the ethical and legal questions of how to ensure data privacy to the researchers and analysts.

  • Uncertainty of Data Access: One should always have in mind that data access is 100% dependent upon Twitter’s willingness to share the data, and thereby also on jurisdiction by which Twitter must abide. Data access for research projects through Facebook’s and Instagram’s APIs has previously been shut-down completely with only few weeks notice. Given that, research projects relying on Twitter data are always risky. This applies particularly to research projects that depends on a constant Twitter data influx over a long period of time (e.g., PhD projects).

  • Data Storage: Data storage can be an issue when many Tweets are collected over a long period of time. In many applications, data collections can easily amount to 100-200 GB per month. The use of powerful servers and storage in a relational database (e.g. SQL) are therefore recommended.

 

b. Converting raw twitter data into useful dataset

With the source dataset prepared, the following data clean-up steps were taken:

  • Removed Tweets with duplicate status IDs. This occurred due to overlap in the timing of data extraction via the Twitter API. The API allows extraction of the last 7 days of Tweets.
  • Cleaned text, including removal of special, non ascii and punctuation characters, replacing numbers and common symbols ($, %, &, @, w/) english word replacements.
  • Spell check, and acronym translation.
  • Dropped variables from the source set which personally identified the users.

 

Next, the data was manually reviewed and more features were engineered:

  • Identified and flagged Tweets containing quoted speech, song lyrics, poem lines, book titles, copy right material, pod-casts, Youtube videos and memes.
  • Identified and flagged Tweets which used emojis or emoticons. The formal Unicode notation for these Tweets was contained within the text. Tweets were both flagged and the Unicode was included as a new variable separate to the text for further specific analysis
  • Identified and flagged Tweets for “famous people” or social influencers with more than 1000 followers
  • Identified and flagged users with twitter subscription time based on years elapsed since account creation
  • Acknowledging the COVID-19 pandemic and its significant and time relevant impacts in Victoria, Australia; Identify and flag Tweets which mention associations to COVID-19, lockdown measures such as isolation and quarantine and mentions to the Victorian State Government regular public service and health announcements.
  • Identified and flagged users which represent a business entity, a social group or club or simply themselves as an individual. Twitter account and profile name were used for this.
  • Identified and flagged Tweets where the user references information which may reveal their current age group

 

Using the text component of the Tweet data, the research features, from (Kivran-Swaine, F, et al, 2014) were applied:

  • Loneliness Type:
    • Individual Loneliness: Occurs when a person is missing someone special or close to them such as a spouse or friend whom they had a close, emotional bond.
    • Social Loneliness: Refers to the absence of a social network made up of a wide group of friends, neighbours and colleagues.
    • Not Loneliness: When no mentions of “lonely” or “loneliness” are observed or the mentions of these words are outside of the scope and context of human loneliness and do not fit the above descriptions of individual and social loneliness.
  • Temporal Bounding: Reference to the time duration of the experience of loneliness.
    • Transient: The expression of loneliness included references to the experience being momentary, at present or potentially short-lived, such as “I’m so lonely right now”.
    • Enduring: The expression of loneliness was temporally framed in a way that suggested a long-lasting state, such as “I hate feeling like this. I’m so lonely and depressed all the time.”
    • Ambiguous: No mentions or references of any kind of temporal bounding.
  • Social Context: Indication of an environment in the context of the Tweet.
    • Online: Where another twitter user has been mentioned or there are references to virtual, internet-based interactions.
    • Offline: Where no twitter users have been mentioned and there are references to face-to-face interactions, in real life.
  • Contexts: How loneliness is expressed by the individual user.
    • Physical: Tangible, physical circumstances accompanying expressions of loneliness. These references can be indications of actual or aspired physical circumstances, as well as the specific conditions of these spaces (e.g., “I’m so lonely! Being in this big house by myself”). These Tweets contained mentions of geographical locations at micro and macrolevels (e.g. room, house, city, country).
    • Romantic: past, present, or aspired romantic or sexual relationships, referenced together with the expression of loneliness. For example, in the following Tweet, the person defines actions that he/she frames as stemming from an the absence of a romantic relationship (note that this Tweet also defines a physical context): “I’m so lonely that I sprayed cologne all over my room so it smells like I have a boyfriend and now I keep smelling my pillows ha help”
    • Somatic: Tweets that referred to users’ physical, mental or emotional state. It included references to the past, present, or aspired state of the users’ physical being (e.g. feeling nauseated, wishing to feel healthier, having a headache, losing sleep) and/or actions one takes towards one’s own body (e.g. taking medication). For example: “I’m so lonely right now lol nowhere near sleepy I been sleep all day finna take some medicine”.
  • Seeking Interaction (Ruiz, C, et al, 2017): User communicated an explicit desire for interaction, such as “I’m so lonely, somebody DM me!” and “Where are you, @anonymizeduser? I’m so lonely in this class!”

With the research applied as variables in the dataset to provide better structure in labelling the data, some gaps emerged where not all Tweets could be completely described or labelled so more features were created to provide labelling coverage:

  • Identified and flagged Tweets where the user describes championing or advocating the cause for raising specific awareness about loneliness, or offering tips, advice, services or support to their network of connections. For example, “Though often felt as a personal problem, it is in fact a shared societal challenge. 1 in 4 adults have no one with whom to share difficult news. #FFGALD20 #loneliness”.
  • Identified and flagged Tweets where the user intentionally uses “lonely” or “loneliness” as an projected weapon to harm, threaten, insult or react to another person or situation. For example, “@9NewsAUS If Dopey Dan had another brain, it would be lonely!!!”

 

The Source Data Variable List

# show the variables, not the values
ls(wrk_100_original_data)
##  [1] "account_created_at"                  "BINAdvocatingAwarenessForLoneliness"
##  [3] "BINCommInteraction"                  "BINCOVID"                           
##  [5] "BINPhysicalContext"                  "BINProjectToOther"                  
##  [7] "BINRomanticContext"                  "BINSocialInfluencer"                
##  [9] "BINSomaticContext"                   "BINTextContainsEmojis"              
## [11] "CATCopyRightMaterial"                "CATDerivedAgeGroup"                 
## [13] "CATLonelinessType"                   "CATSocialContext"                   
## [15] "CATTemporalBounding"                 "CATUserIndividualOrGroup"           
## [17] "country"                             "country_code"                       
## [19] "created_at"                          "description"                        
## [21] "display_text_width"                  "ext_media_expanded_url"             
## [23] "favourites_count"                    "followers_count"                    
## [25] "friends_count"                       "hashtags"                           
## [27] "listed_count"                        "location"                           
## [29] "media_url"                           "mentions_screen_name"               
## [31] "mentions_user_id"                    "name"                               
## [33] "NUMYearsTwitterUser"                 "place_full_name"                    
## [35] "place_name"                          "place_type"                         
## [37] "profile_expanded_url"                "quoted_created_at"                  
## [39] "quoted_description"                  "quoted_followers_count"             
## [41] "quoted_friends_count"                "quoted_location"                    
## [43] "quoted_name"                         "quoted_screen_name"                 
## [45] "quoted_source"                       "quoted_status_id"                   
## [47] "quoted_statuses_count"               "quoted_text"                        
## [49] "quoted_user_id"                      "reply_to_screen_name"               
## [51] "reply_to_status_id"                  "reply_to_user_id"                   
## [53] "retweet_count"                       "screen_name"                        
## [55] "source"                              "status_id"                          
## [57] "status_url"                          "statuses_count"                     
## [59] "text"                                "TXTEmojiFound"                      
## [61] "url"                                 "urls_expanded_url"                  
## [63] "user_id"                             "verified"

 

The Main Analysis Dataset Structure

### b.  Converting raw twitter data into useful dataset
glimpse(wrk_100_main_analysis)
## Rows: 1,063
## Columns: 45
## $ user_id                             <chr> "x2552464932", "x10850998393339...
## $ status_id                           <chr> "x1282676392178507776", "x12895...
## $ created_at                          <chr> "2020-07-13T14:00:00Z", "2020-0...
## $ screen_name                         <chr> "__coffeebean", "__MelBarrett",...
## $ text                                <chr> "@unicornmommy3 @JenniferLeahMD...
## $ source                              <chr> "Twitter for iPhone", "Twitter ...
## $ display_text_width                  <int> 202, 158, 245, 39, 236, 50, 221...
## $ NUMYearsTwitterUser                 <int> 6, 1, 5, 6, 7, 11, 4, 4, 8, 2, ...
## $ BINSocialInfluencer                 <int> 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0...
## $ CATUserIndividualOrGroup            <chr> "Individual", "Individual", "In...
## $ BINTextContainsEmojis               <int> 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0...
## $ TXTEmojiFound                       <chr> "", "<U+0001F643><U+0001F643>",...
## $ CATCopyRightMaterial                <chr> "None", "None", "None", "None",...
## $ CATLonelinessType                   <chr> "Social_Loneliness", "Social_Lo...
## $ CATTemporalBounding                 <chr> "Ambiguous", "Ambiguous", "Ambi...
## $ CATSocialContext                    <chr> "Online", "Offline", "Offline",...
## $ BINPhysicalContext                  <int> 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0...
## $ BINRomanticContext                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ BINSomaticContext                   <int> 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0...
## $ CATDerivedAgeGroup                  <chr> "Adult", "Adult", "Adult", "Adu...
## $ BINCommInteraction                  <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1...
## $ BINCOVID                            <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ BINAdvocatingAwarenessForLoneliness <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ BINProjectToOther                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ retweet_count                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ hashtags                            <chr> "", "", "", "", "", "", "", "",...
## $ place_full_name                     <chr> "", "", "", "", "Melbourne, Vic...
## $ location                            <chr> "Melbourne, Victoria", "Melbour...
## $ followers_count                     <int> 359, 191, 1543, 100, 2829, 3197...
## $ friends_count                       <int> 1074, 286, 390, 193, 1802, 484,...
## $ listed_count                        <int> 6, 2, 31, 1, 28, 15, 0, 0, 9, 1...
## $ statuses_count                      <int> 4490, 1894, 29544, 1622, 22844,...
## $ favourites_count                    <int> 5273, 5078, 44575, 1524, 16142,...
## $ urls_expanded_url                   <chr> "", "", "", "", "", "", "", "",...
## $ account_created_at                  <chr> "2014-06-07T12:22:00Z", "2019-0...
## $ status_url                          <chr> "https://twitter.com/__coffeebe...
## $ CATUserDerivedLocation              <chr> "InnerMelbourne", "InnerMelbour...
## $ CATTweetCreatedDay                  <chr> "Mon", "Sat", "Tue", "Thu", "Fr...
## $ TIMTweetCreatedTime                 <int> 14, 12, 17, 2, 20, 13, 2, 11, 8...
## $ long                                <dbl> 144.9515, 144.9515, 144.9515, 1...
## $ lat                                 <dbl> -37.81701, -37.81701, -37.81701...
## $ CATAnalysis_1                       <chr> "ambiguous_social_loneliness", ...
## $ CATAnalysis_2                       <chr> "_physical__somatic_", "_somati...
## $ CATAnalysis_3                       <chr> "_NoAltContext_", "_NoAltContex...
## $ CATAnalysis_4                       <chr> "ambiguous_social_loneliness_ph...

 

The Emoji Dataset Structure

glimpse(wrk.100_emojifreqsummary)
## Rows: 212
## Columns: 52
## $ status_id                     <chr> "x1275738318072774657", "x12761258166...
## $ NUMEmoji_animalbird           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_animalbug            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_animalmammal         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_animalmarine         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_arrow                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_bodyparts            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_bookpaper            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_catface              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_clothing             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_countryflag          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_drink                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_emotion              <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ NUMEmoji_event                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ NUMEmoji_faceaffection        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0...
## $ NUMEmoji_faceconcerned        <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_facecostume          <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0...
## $ NUMEmoji_facehand             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_facenegative         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ NUMEmoji_faceneutralskeptical <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_facesleepy           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_faceslepy            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_facesmiling          <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_faceunwell           <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_family               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_foodfruit            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_game                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_handfingersclosed    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_handfingerspartial   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_hands                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_handsinglefinger     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_lightvideo           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_money                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_music                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_No_Mapping_Available <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_objectsclothing      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_office               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_othersymbol          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_person               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_personactivity       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_personfantasy        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_persongesture        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_personrole           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_phone                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_placebuilding        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_placemap             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_plantflower          <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_punctuation          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_singlehandfinger     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMEmoji_skyweather           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ NUMEmoji_sport                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NUMTotalEmojiInTweet          <int> 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1...

 

The Text Dataset Structure

glimpse(wrk_LNG_text_analysis)
## Rows: 26,663
## Columns: 19
## $ status_id                        <chr> "x1282676392178507776", "x12826763...
## $ text                             <chr> "@unicornmommy3 @JenniferLeahMD @d...
## $ NUMSyuzhetTweetSentiment         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ NUMNRTweetSentiment_anger        <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ NUMNRTweetSentiment_anticipation <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ NUMNRTweetSentiment_disgust      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ NUMNRTweetSentiment_fear         <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ NUMNRTweetSentiment_joy          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ NUMNRTweetSentiment_sadness      <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ NUMNRTweetSentiment_surprise     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ NUMNRTweetSentiment_trust        <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ NUMNRTweetSentiment_negative     <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ NUMNRTweetSentiment_positive     <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ word                             <chr> "i", "think", "the", "ed", "is", "...
## $ Valence                          <dbl> NA, 0.786, NA, NA, NA, NA, 0.906, ...
## $ Arousal                          <dbl> NA, 0.408, NA, NA, NA, NA, 0.337, ...
## $ Dominance                        <dbl> NA, 0.618, NA, NA, NA, NA, 0.620, ...
## $ word_stemmed                     <chr> "i", "think", "the", "ed", "is", "...
## $ stem_completed_word              <chr> "i", "think", "the", "ed", "is", "...

c. Filtering the Analysis Datasets (Identifying “lonely” Tweets)

The source dataset contains all extracted Tweets which met the search criteria for Tweets containing “lonely” or “loneliness” in the text. The search query also matched and retrieved Tweets for users who contain “lonely” and “loneliness” in their user names. In these circumstances the variable CATLonelinessType can be used to filter the main analysis dataset and exclude Tweets flagged as “NotLoneliness”. When we apply this filter, we reduce the observation count to 1063 Tweets.

 

wrk_100_main_analysis <- wrk_100_main_analysis %>%
  dplyr::filter(CATLonelinessType %ni% c("NotLoneliness") )
# 1063 observations

 

d. Sampling versus Population and handling low sample size

The analysis datasets contain 1063 observations (Tweets). This collection of data represents a subset of a much larger population of Tweets and human users. We’re only looking at Tweets containing a specific search criteria and so we must take care when analysing and interpreting results. Just because we see a pattern emerging in this set of data does not make it representative of the wider Twitter data population or real, human community at large. We must be careful in making conclusions. However, we can generate ideas and new thinking by observing the data in this set.

Its important to remember, Twitter users do not represent a random sample from a given population. This is not only due to the presence of bots and company or institutional accounts, but also to the manifold self-selection processes that using Twitter entails:

Population → Internet Users → Twitter Users → Active Twitter Users → Users sharing geo-information

To summarise this sample size challenge in context of Australian population data and social media usage:

 

e. Applied Research, data dictionary and definitions for variables

For this analysis we will use the variables as described by in previous section 6b to group the data and apply descriptive statistics, comparison analysis, visualisations and unsupervised machine learning algorithms.

Variable features which were created for the analysis datasets use a three-letter acronym prefix to denote the expected general data type values:

  • NUM: Numeric values which have a range beyond binary format, may include NA (Missing/Null).
  • BIN: Binary values. 1 and 0 only.
  • CAT: Categorical value. Structured and consistent groupings
  • TXT: Free/unstructured text.
  • TIM: Time values. Specifically formatted as HH:MM:SS.ss

 

f. Manual inspection of raw twitter data, declaration of observations

When manually exploring and labelling the source data, some patterns in the data were observed. This section is dedicated to declaring these observations.

Data examples of the contexts of loneliness:

  1. Doing something constructive about loneliness, the optimists
  2. Yearning for and enjoying companionship with pets, animals or wildlife
  3. Giving support to others who are experiencing loneliness
  4. Experiencing loneliness with a will power, motivation and drive to get through it. The self efficacy
  5. Advertising services, forums, seminars, podcasts and programs
  6. Searching for the opportunity within one’s self. The challenge of life and the self-learning

Advocating for those who don’t have a voice or are unable to speak out about their experiences of loneliness:

  1. Aged Care, the elderly population and the vulnerable
  2. Children and youth attending school
  3. People with chronic illness, impairment and disability

People describing going through life challenging events:

  1. Sickness, disease, treatments, rehabilitation
  2. Whistle-blowing on corruption, such as Aged Care royal commission and COVID-19 hotel quarantine.
  3. Leaving abusive relationships
  4. Acceptance of one’s personal state
  5. Struggling with one’s personal state, a non acceptance to be alone or sit in the emotion which loneliness can bring

People who are alone but not lonely. In this analysis, this is labelled as “Not Loneliness”.

Observations and projection of loneliness, without first person ownership of the state:

  1. “look at this lonely guy”
  2. “they look lonely”
  3. “loneliness in melbourne”
  4. “thats a lonely place”
  5. “its been lonely”

Using loneliness as a projected threat:

  1. “you must be lonely”
  2. “i hope they like loneliness”

Metaphors and media about loneliness, describing detached conversations about loneliness, detached expressions, or things which evoke feelings of loneliness:

  1. Books “A lonely girl is a dangerous thing” - Jess Tu
  2. Songs: Sgt. Pepper’s Lonely Hearts Club Band, an Album by The Beatles
  3. News Articles
  4. Research articles, journals, inventions and technology
  5. Pictures, visual art, posters, memes
  6. Webinars
  7. Blogs
  8. Writer’s festival and author writing groups
  9. Seminars, forums, panel discussions

COVID-19 impacts:

  1. Physical isolation, lockdown and quarantine
  2. Prolonged social disconnection
  3. Aggravation of pre-existing mental diseases, illness and disorders
  4. Anxiety and uncertainty of the future. Safety and security instability.

Loneliness described with experiences of boredom

Sex industry and advertisement of services

 

 

7. Manual data exploration

This section is dedicated to exploring the data and key guiding questions to discover actionable insights.

 

Main Analysis Dataset

 

Summary Statistics for the Main Analysis Dataset

##    user_id           status_id          created_at        screen_name       
##  Length:1063        Length:1063        Length:1063        Length:1063       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      text              source          display_text_width NUMYearsTwitterUser
##  Length:1063        Length:1063        Min.   :  6.0      Min.   : 0.000     
##  Class :character   Class :character   1st Qu.: 75.0      1st Qu.: 2.000     
##  Mode  :character   Mode  :character   Median :140.0      Median : 6.000     
##                                        Mean   :150.4      Mean   : 5.537     
##                                        3rd Qu.:232.0      3rd Qu.: 9.000     
##                                        Max.   :301.0      Max.   :13.000     
##                                                                              
##  BINSocialInfluencer CATUserIndividualOrGroup BINTextContainsEmojis
##  Min.   :0.0000      Length:1063              Min.   :0.000        
##  1st Qu.:0.0000      Class :character         1st Qu.:0.000        
##  Median :0.0000      Mode  :character         Median :0.000        
##  Mean   :0.3104                               Mean   :0.175        
##  3rd Qu.:1.0000                               3rd Qu.:0.000        
##  Max.   :1.0000                               Max.   :1.000        
##                                                                    
##  TXTEmojiFound      CATCopyRightMaterial CATLonelinessType  CATTemporalBounding
##  Length:1063        Length:1063          Length:1063        Length:1063        
##  Class :character   Class :character     Class :character   Class :character   
##  Mode  :character   Mode  :character     Mode  :character   Mode  :character   
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  CATSocialContext   BINPhysicalContext BINRomanticContext BINSomaticContext
##  Length:1063        Min.   :0.0000     Min.   :0.0000     Min.   :0.0000   
##  Class :character   1st Qu.:0.0000     1st Qu.:0.0000     1st Qu.:0.0000   
##  Mode  :character   Median :0.0000     Median :0.0000     Median :0.0000   
##                     Mean   :0.1505     Mean   :0.1505     Mean   :0.4911   
##                     3rd Qu.:0.0000     3rd Qu.:0.0000     3rd Qu.:1.0000   
##                     Max.   :1.0000     Max.   :1.0000     Max.   :1.0000   
##                                                                            
##  CATDerivedAgeGroup BINCommInteraction    BINCOVID     
##  Length:1063        Min.   :0.0000     Min.   :0.0000  
##  Class :character   1st Qu.:0.0000     1st Qu.:0.0000  
##  Mode  :character   Median :1.0000     Median :0.0000  
##                     Mean   :0.5616     Mean   :0.3076  
##                     3rd Qu.:1.0000     3rd Qu.:1.0000  
##                     Max.   :1.0000     Max.   :1.0000  
##                                                        
##  BINAdvocatingAwarenessForLoneliness BINProjectToOther retweet_count     
##  Min.   :0.0000                      Min.   :0.0000    Min.   :  0.0000  
##  1st Qu.:0.0000                      1st Qu.:0.0000    1st Qu.:  0.0000  
##  Median :0.0000                      Median :0.0000    Median :  0.0000  
##  Mean   :0.2832                      Mean   :0.1656    Mean   :  0.9981  
##  3rd Qu.:1.0000                      3rd Qu.:0.0000    3rd Qu.:  0.0000  
##  Max.   :1.0000                      Max.   :1.0000    Max.   :449.0000  
##                                                                          
##    hashtags         place_full_name      location         followers_count 
##  Length:1063        Length:1063        Length:1063        Min.   :     0  
##  Class :character   Class :character   Class :character   1st Qu.:   123  
##  Mode  :character   Mode  :character   Mode  :character   Median :   384  
##                                                           Mean   :  2612  
##                                                           3rd Qu.:  1448  
##                                                           Max.   :200594  
##                                                           NA's   :1       
##  friends_count       listed_count     statuses_count   favourites_count
##  Min.   :     0.0   Min.   :   0.00   Min.   :     1   Min.   :     0  
##  1st Qu.:   197.2   1st Qu.:   1.00   1st Qu.:  1170   1st Qu.:  1093  
##  Median :   520.0   Median :   4.00   Median :  4604   Median :  5567  
##  Mean   :  1131.6   Mean   :  40.56   Mean   : 15455   Mean   : 16332  
##  3rd Qu.:  1112.8   3rd Qu.:  22.00   3rd Qu.: 16197   3rd Qu.: 18841  
##  Max.   :135394.0   Max.   :3511.00   Max.   :510189   Max.   :331637  
##  NA's   :1          NA's   :1         NA's   :1        NA's   :1       
##  urls_expanded_url  account_created_at  status_url       
##  Length:1063        Length:1063        Length:1063       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  CATUserDerivedLocation CATTweetCreatedDay TIMTweetCreatedTime      long      
##  Length:1063            Length:1063        Min.   : 0.000      Min.   :144.4  
##  Class :character       Class :character   1st Qu.: 4.000      1st Qu.:145.0  
##  Mode  :character       Mode  :character   Median : 8.000      Median :145.0  
##                                            Mean   : 9.292      Mean   :144.9  
##                                            3rd Qu.:13.000      3rd Qu.:145.0  
##                                            Max.   :23.000      Max.   :145.1  
##                                                                               
##       lat         CATAnalysis_1      CATAnalysis_2      CATAnalysis_3     
##  Min.   :-38.15   Length:1063        Length:1063        Length:1063       
##  1st Qu.:-37.82   Class :character   Class :character   Class :character  
##  Median :-37.82   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :-37.83                                                           
##  3rd Qu.:-37.82                                                           
##  Max.   :-37.74                                                           
##                                                                           
##  CATAnalysis_4     
##  Length:1063       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

 

How Many Twitter Users are Represented, by Location?

Pinpointing exact locations of where a Tweet geographically originated on can be technically tricky. What we can generally say from this view is, most of the Tweets have originated from “close” to the Geelong region, within at least 100km radius.

Tweet Location User Type Total Users Total Tweets Average Years Twitter User
EastMetro Individual 6 7 4.57
GeelongBellarine Individual 17 20 5.05
InnerMelbourne Business 18 22 5.91
InnerMelbourne Group 21 78 4.91
InnerMelbourne Individual 756 905 5.62
NorthMetro Business 1 3 4
NorthMetro Individual 5 6 7
SouthEastMetro Individual 5 6 6.17
Unknown Individual 10 10 3.1
WestMetro Individual 6 6 5

 

Top Twitter Users by Number of Tweets About Loneliness

Using the “screen_name” variable we can tally up the distinct Twitter users represented in the dataset.

It’s encouraging to see “@friendsfor_good”, “@EndLonelinessAU” and “@Dr_FaithG” on this list. These are Twitter users who advocate for raising awareness about the loneliness subject.

User Total Tweets
@friendsfor_good 37
@shaunrowland11 11
@DigitalMehmet 8
@EndLonelinessAU 5
@barbaraneves 4
@Dr_FaithG 4
@fakenotears 4
@myronmy9 4
@Nashtopia 4
@pixelsmixel 4

 

Timeline of Tweets

Acknowledging the timeline for capturing data, a 6 week period, is quite a short amount of time. In 2020, with the COVID-19 pandemic and it especially severely impacting Victoria, it is curious to note that around the first week of July and first week of August there are spikes in the numbers of Tweets about loneliness. These time points also coincide with Victorian State Government announcements about COVID-19 public safety and lock-down measure announcements. Melbourne was placed on stage-4 restrictions, while regional Victoria was placed on stage-3.

 

Tweets at Times Throughout the Day

This density plot shows us that the highest volume of Tweets were sent out during the 4AM to 10AM time frame, with a lowest volume at 5PM with a small increase at 10PM.

This is interesting to observe, however without full context may not be as informative.

 

Tweets Throughout the Week

This bar plot shows that Wednesday’s and Thursday’s during the week were the most popular days for Tweets about loneliness. It is curious to observe that the weekend days were the least popular days for Tweeting about loneliness.

 

Locations and Geographical Origins of Tweets

Most Twitter users turn off their location in their privacy settings but those that don’t add valuable location information to their Tweets.

By using “place_full_name” supplemented with “location”, it was possible to manually allocate geographical metropolitan and regional areas to generally categorise Tweet locations.

In the table below, we can observe that “InnerMelbourne” represents the majority of Tweet locations, followed by “GeelongBellarine”. It is possible that “GeelongBellarine” is under-represented here as users may opt to display the nearest capital city (Melbourne) as their location rather than precisely where they geographically are. It is also equally possible that mobile phones, laptops and personal computer devices may be using the nearest cell tower or internet service provider network server to give a general geographical position at the time of Tweet creation.

In the map below, the dark green circle represents a 50km radius around the Geelong CBD, the geographical parameter used in the Twitter API data extraction. The instances of number of Tweets per general Tweet location are overlayed. The groups “Unknown” and “GeelongBellarine” have been combined in this view.

#Tweet Locations

#Data Setup
#Definition Location patterns used in grepl of CATUserDerivedLocation in wrk_100_main_analysis.
loc_unknown <- c("#purpose: shelter my soul", "any pronouns", "she/her")
loc_eastmetro <- c("Blackburn", "Box Hill", "Abbotsford", "Hawthorn", "Ringwood", "Surrey Hills")
loc_northmetro <- c("Brunswick", "Coburg", "Epping", "Northcote")
loc_westmetro <- c("3018", "Footscray", "Melton", "Point Cook","Werribee")
loc_southeastmetro <- c("Clayton", "Elwood", "Murrumbeena", "Frankston","Mornington Peninsula")
loc_geelongbellarine <- c("Geelong", "Lara", "Barwon Heads", "Drysdale","Clifton Springs","Portarlington","Ocean Grove")
loc_regionalvic <- c("Colac")
loc_surfcoast <- c("Angelsea")
loc_innermelb <- c("mel", "Melbourne", "MELBOURNE", "Albert Park", "Carlton", "Docklands","Prahan","Somewhere in Melbourne","South Yarra")

DerivedLocations <- c("Unknown", "EastMetro","NorthMetro", "WestMetro","SouthEastMetro", "GeelongBellarine", "RegionalVic", "SurfCoast", "InnerMelbourne" )
lat <- c(-38.148743,-37.813680, -37.741317, -37.859042, -37.905374, -38.148743, -38.342728, -38.408793, -37.817007)
long <- c(144.365739,145.089686, 144.966709, 144.796066, 145.125195, 144.365739, 143.584951, 144.162215, 144.951542)
geocoordslocation <- data.frame(DerivedLocations,long,lat ) # This set is left joined to wrk_100_main_analysis by DerivedLocations = CATUserDerivedLocation


wrk_100_main_analysis %>% 
  filter(!is.na(CATUserDerivedLocation)) %>% 
  count(CATUserDerivedLocation, sort = TRUE) 
#Geographical Plot of user locations:
# Setting up summary dataset for geographical plotting:
# Combine the Unknown location with Geelong, else when we plot it, it will overlap.
wrk.geoplot <- wrk_100_main_analysis %>%
  mutate(CATUserDerivedLocation = ifelse(CATUserDerivedLocation == "Unknown", "GeelongBellarine",CATUserDerivedLocation)) %>%
  group_by(long, lat, CATUserDerivedLocation) %>%
  summarise(NUMUsers = n_distinct(user_id),
            NUMTweets = n()  ) 
# Create groups to plot
userlocations.df <- split(wrk.geoplot, wrk.geoplot$CATUserDerivedLocation)

pal <- colorFactor(c("purple", "navy", "turquoise", "orange", "gold", "green"),
                   domain = c("InnerMelbourne", "GeelongBellarine", "NorthMetro", "EastMetro", "SouthEastMetro", "WestMetro"))

#Define the leaflet object
#leaflet(wrk.geoplot) %>% addTiles() %>% addMarkers(
#  clusterOptions = markerClusterOptions()
#)


l <- leaflet() %>% addTiles() 
names(userlocations.df) %>%
  purrr::walk( function(df) {
    l <<- l %>%
      addProviderTiles(provider = "CartoDB.Positron") %>%
      addLabelOnlyMarkers(data = userlocations.df[[df]], lng = ~long, lat = ~lat, 
                          label = ~as.character(NUMTweets) , 
                          labelOptions = labelOptions(noHide = T, direction = 'center', textOnly = T, textsize = "20px",
                                                      style = list(
                                                        "color" = "black",
                                                        "font-family" = "arial",
                                                        "font-style" = "bold"))) %>%
      addCircles(lng = 144.3657, lat = -38.14874, radius = 50000, col = "green", fill=FALSE) %>%
      addCircleMarkers(data = userlocations.df[[df]],
                       radius = 45,
                       color = ~pal(CATUserDerivedLocation),
                       lng = ~long, lat = ~lat,
                       stroke = FALSE, 
                       fillOpacity = 0.5,
                       label = ~as.character(CATUserDerivedLocation), 
                       popup = ~as.character(NUMTweets),
                       group = df)
  })
# Create UI control layer for the map plots
l %>%
  addLayersControl(
    overlayGroups = names(userlocations.df),
    options = layersControlOptions(collapsed = FALSE)
  )

 

 

How Much Text Content Per Tweet?

We can inspect “display_text_width” to observe how many text characters are used per Tweet.

The maximum character limit for Twitter is 280 characters. The median for this dataset is 140 characters per Tweet, approximately one sentence worth, per Tweet.

 

 

What Devices are Used to Communicate Tweets?

We can inspect “source” to observe the devices which are used to Tweet from.

This list reveals there are many device, linkage and platform options to “Tweet” from.

At the top of this list it is evident that users are using an iPhone, Laptop/PC or Android smartphone to use Twitter.

Tweet Sources Total Users Total Tweets
Twitter for iPhone 370 440
Twitter Web App 219 284
Twitter for Android 181 220
Twitter for iPad 23 24
Hootsuite Inc.  13 17
TweetDeck 11 14
Instagram 10 12
Buffer 5 9
LinkedIn 3 5
IFTTT 3 4
dlvr.it 2 3
Sprout Social 2 3
Tweetbot for iS 3 3
Fenix 2 1 2
Goodreads 2 2
How You Really Feel 1 2
Microsoft Power Platform 2 2
Tweetbot for Mac 1 2
Echobox 1 1
Falcon Social Media Management 1 1
Flamingo for Android 1 1
Google 1 1
Grabyo 1 1
HubSpot 1 1
Lightful 1 1
Missinglettr 1 1
Ripl App 1 1
Scoop.it 1 1
SEMrush Social Media Tool 1 1
SocialBee.io v2 1 1
Streamlabs Twitter 1 1
Twitter for Mac 1 1
WordPress.com 1 1

 

The Most Re-Tweeted Tweets

Even though the Twitter API search query was configured not to extract re-Tweets, we have access to ReTweet counts in the data.

The Twitter API helpfully returns a “retweet_count” variable whose values can easily be sorted. Here we sort all the Tweets in descending order by the size of the “retweet_count”.

Created User ReTweet Count Text
2020-07-31T13:23:00Z trashcanprince 449

even an evil manifestation gets lonely sometimes

#neiboltreddie #reddie https://t.co/UDamH44Oak
2020-07-30T09:37:00Z luciemorrismarr 51 This is my neighbour. He’s 85, a widower and lives alone with his dog. He finds life lonely but today he said through his window he feels relief at his life choices. “I’m not in aged care, I’m alive,” he says. #COVID19Vic https://t.co/HMnj2ycNak
2020-07-08T00:30:00Z ReadingsBooks 48 Okay! Because we know it can be loneliness inducing thinking about the next six weeks, here are six things we can all do at home to feel part of the bookish community and deal with that feeling of isolation …
2020-07-28T21:44:00Z MazinB_ 42 Always pick being alone rather than settle for someone that makes you feel lonely
2020-07-30T05:12:00Z barbaraneves 24 When we wrote this piece on experiences of lonely older Australians we couldn’t anticipate how things would be now: so much worse! Our participants are reporting suicidal ideation and complete despair, because they’re not only lonely, they feel disposable & blamed. Ageism kills! https://t.co/0UoWG3zpo5
2020-07-31T01:31:00Z ItsSpoopsB 23 Scoops feeling a bit lonely https://t.co/kpovubxElJ
2020-07-07T10:01:00Z philipdalidakis 19 Please take care out there & look out for each other. Another lock down will be challenging to many. Financial pressures, loneliness/isolation, family violence. But you’re not alone. Call Lifeline 131144 or Beyond Blue 1300 22 4636 24 hours a day. We will only beat this together!
2020-08-02T09:53:00Z DrEricLevi 13

Dear Melbourne,

You can be in a crowd and feel lonely.

You can be alone and not feel lonely.

Connect well during this season. We can be physically distant but be socially and emotionally connected.
2020-07-14T07:28:00Z MartinFoleyMP 13 Victorians struggling with loneliness will now be able to receive support from @RedCrossAU and community organisations thanks to the $6 million Community Activation and Social Isolation initiative, funded through our $59.4 million mental health and wellbeing package #springst https://t.co/wHh9wswdtN
2020-07-16T01:44:00Z NormanHermant 12

At #agedcareRC devastating evidence from 91 yr old Beryl Hawkins via telephone.

She lives on her own in a Housing Commission flat in #Sydney.

She experienced loneliness & depression, ‘sitting for hrs not being able to talk to anyone.’

Also terrible #dental problems.

-Thread- https://t.co/EVpLvSRPNk

 

It is possible to extract a visual screen-shot of the re-Tweeted Tweets, using Tweet_screenshot() from the Tweetrmd package. Just provide the “screen_name” and “status_id”.

 

Re-Tweeted Tweets About Loneliness

The following set of Tweet screenshots are from the top 10 re-Tweeted Tweets about loneliness.

 

Tweets with the Most Likes

To find the most liked Tweet we can sort the Tweets by the “favorite_count” variable in descending order and print the rows with the highest counts.

By observing the number of re-Tweets and likes, we can begin to assess the social “reach” that Tweets about loneliness may have on the social media community and network of associated people.

Created User Favourites Count Text
2020-08-01T21:49:00Z DamienEvans7 331637 @mrsnb16 @PatsKarvelas The loneliness of the long distance runner - Iron Maiden
2020-08-04T00:59:00Z ewster 321567

Gaaaaah they be buzzing about again, can hear the bickering, and the lonely guy keeps zooming up and down Elizabeth St.

He needs a date. Tough going under lockdown.
2020-07-07T05:52:00Z RV_27 313584 @Origsmartassam You look a bit lonely Sam
2020-07-30T05:17:00Z Wolfie_Rankin 276877

@LugubriousLarry Thankyou kindly.

I’m grateful for those who are online with me, but in real life I’m quite lonely. I miss my family terribly. This old house used to be a very busy place, people would randomly drop in, the phone was always going. And now most of the time there’s silence.
2020-07-17T10:53:00Z gerster_kaylene 276638 @Gary_Hardgrave @EndlessEcho121 Sadly these stories of lonely grief will last longer than virus…lots of pent up anger will cause many problems in the future..
2020-07-07T07:08:00Z gerster_kaylene 275671 @barrelracernt @bouta_nt Bored and lonely…
2020-08-05T09:55:00Z stofsk 254978 @damoj Maybe I’m just sad and lonely but that wouldn’t be a turn off for me
2020-07-09T08:33:00Z andrewfx_51 211098 @Asher_Wolf @Sunsplashsun Serious pros and cons to the idea. A lot of classicists/fundamentalists do it because you can devote greater time to classical languages. For someone who was “gifted” the more important skills I learned were not academic, although it was very lonely also
2020-07-16T03:00:00Z missannaklein 197815 @BMiyakee @lenacarti Lost in an image, in a dream But there’s no one there to wake her up - the world is spinning But tell me what happens when it stops? They go And they say She’s so lucky, she’s a star But she cry, cry, cries in her lonely heart, thinking If there’s nothing missing in my life…
2020-07-31T09:41:00Z mareefeb 172230 @greyham65 @chelsea_hetho no one dies before their time the majority of the deaths have been to people already dying of other complaints if covid really was the cause of death who can tell?…it is the lonely isolated deaths that is the problem & anyone who thinks that is okay is just cruel

 

Most Liked Tweets About Loneliness

The following set of Tweet screenshots are from the top 10 most liked Tweets about loneliness.

 

 

The Most Frequently Used Hashtags

Helpfully, the source twitter data contains a separate variable which records the hashtags used in the Tweet. This makes it easier to swiftly summarise and count.

On inspection, a majority of the top 10 hashtags relate to “FFGALD20”, which was described in the previous section as the “Friends for Good” and “The Australian Loneliness Dialogue 2020”. The remainder of the list relate to COVID-19 and news media coverage.

One challenge with hashtags is there is no validation for a hashtag or its usage, no one governs hashtags, they are a social media phenomena of their own. Many hashtags may exist, be worded or spelled differently and relate to the exact same thing. For example in the list below we see “loneliness FFGALD20”, “Loneliness FFGALD20” (see upper case L) and “FFGALD20 loneliness”, and for the topic of COVID we see “COVID19Vic”, “COVID19Vic melbournelockdown” and “CovidVic” ; these all relate to the exact same thing and are spelled and structured differently. Of course, one simple treatment for this may be to set all hashtags to lower case, but this does not solve the problem of many different wordings meaning the same thing in context.

There is certainly opportunity to explore this further and in-fact, there is an opportunity to consider creating a hashtag for #LonelinessInGeelong, publicise it and get it “trending” on social media.

Hashtags Hashtag Count
FFGALD20 10
loneliness FFGALD20 6
loneliness 4
Loneliness FFGALD20 4
COVID19Vic 3
FFGALD20 loneliness 3
9News 2
COVID19Vic melbournelockdown 2
CovidVic 2
loneliness digitalage 2

 

Which Twitter Users are Mentioned the Most?

Here we tokenise the text of each Tweet and use str_detect() from the stringr package to filter out words that start with an @ .

In this list we observe the Premier of the State of Victoria, Dan Andrews, Friends for Good, ABC Melbourne and The Age newsmedia agency.

Mentions User Name Mentions Count
@DanielAndrewsMP 14
@rpatulny 6
@dariusdevas 5
@theage 5
@friendsforgood 4
@JessTu2 4
@MarleeBower 4
@abcmelbourne 3
@AcadSocSciences 3
@bairdjulia 3

 

Screenshots Tweets Containing Users Who are Frequently Mentioned

The Tweet screen-shot below is an example of a response to a new book which had been released in Victoria. A new author named Jessie Tu (see top mentions: @JessTu2) has written and published a book titled “A Lonely Girl is a Dangerous Thing”. This book not only contains “lonely” in the title, a key word in this analysis, but it is actually a book which includes the topic of loneliness and evokes a strong response from within the book reading community. There are numerous Tweets in the analysis data which reference this book title and the topics it covers.

 

Summary of Research Variables

Linking back to Section 6b where features were created to enhance the analysis datasets, the variables which were inspired by the works of Kivran-Swaine, F, et al (2014) and Ruiz, C, et al (2017) will be used to form categorical groups which can be used in the subsequent emoji and text analysis.

The primary analysis grouping that this analysis will use for subsequent sections is comprised of a concatenation of “Temporal Bounding” (CATTemporalBounding) and “Loneliness Type” (CATLonelinessType), this gives us 6 categorical groups, a fairly good number to commence analysis with. In contrast, when this categorical grouping is combined with the Loneliness Context variables (BINPhysicalContext, BINRomanticContext, BINSomaticContext), Communication Interaction (BINCommInteraction), COVID-19 flag (BINCOVID), Advocating Awareness for Loneliness (BINAdvocatingAwarenessForLoneliness) and Projecting Loneliness (BINProjectToOther), the cardinality of the categories is 57, which is too many for most statistical summaries to visualise neatly.

The tables below describe the combined analysis groups and show the cardinality (unique combinations).

Temporal Bounding and Loneliness Type Total Users Total Tweets
ambiguous_social_loneliness 484 631
enduring_social_loneliness 228 251
transient_social_loneliness 64 68
enduring_individual_loneliness 48 52
ambiguous_individual_loneliness 47 49
transient_individual_loneliness 12 12
Loneliness Context Total Users Total Tweets
seekinginteraction 221 272
NoContext 168 228
_somatic__seekinginteraction_ 135 148
somatic 127 138
_physical__somatic_seekinginteraction 64 70
_romantic__somatic_seekinginteraction 53 56
_romantic__somatic_ 34 37
_physical__somatic_ 31 32
_physical__romantic__somatic__seekinginteraction_ 28 29
_romantic__seekinginteraction_ 11 13
_physical__romantic_somatic 11 12
romantic 11 11
_physical__seekinginteraction_ 8 8
physical 7 7
_physical__romantic_ 1 1
_physical__romantic_seekinginteraction 1 1
Loneliness Alternative Context Total Users Total Tweets
NoAltContext 397 442
covid 143 157
lonelinessadvocate 87 148
_covid__lonelinessadvocate_ 105 140
projectingloneliness 125 135
_covid__projectingloneliness_ 26 28
_lonelinessadvocate__projectingloneliness_ 10 11
_covid__lonelinessadvocate_projectingloneliness 2 2
Full Descriptor Total Users Total Tweets
ambiguous_social_loneliness_seekinginteraction_ 206 254
ambiguous_social_loneliness 158 217
enduring_social_loneliness_somatic_ 63 69
enduring_social_loneliness_somatic_seekinginteraction 63 66
ambiguous_social_loneliness_somatic_seekinginteraction 47 54
enduring_social_loneliness_physical__somatic__seekinginteraction_ 46 50
ambiguous_social_loneliness_somatic_ 43 45
transient_social_loneliness_somatic_seekinginteraction 21 23
enduring_social_loneliness_physical_somatic 21 22
transient_social_loneliness_somatic_ 20 22
enduring_individual_loneliness_romantic__somatic__seekinginteraction_ 18 19
ambiguous_individual_loneliness_romantic__somatic__seekinginteraction_ 15 16
ambiguous_social_loneliness_physical__somatic__seekinginteraction_ 14 16
enduring_individual_loneliness_physical__romantic__somatic_seekinginteraction 10 11
enduring_individual_loneliness_romantic_somatic 10 11
enduring_social_loneliness_romantic__somatic__seekinginteraction_ 11 11
enduring_social_loneliness_physical__romantic__somatic_seekinginteraction 10 10
ambiguous_social_loneliness_romantic_somatic 8 9
ambiguous_individual_loneliness_romantic_somatic 8 8
ambiguous_social_loneliness_physical_seekinginteraction 8 8
enduring_social_loneliness_seekinginteraction_ 8 8
ambiguous_individual_loneliness_romantic_seekinginteraction 7 7
ambiguous_social_loneliness_physical_ 7 7
ambiguous_social_loneliness_physical_somatic 7 7
enduring_social_loneliness_physical__romantic__somatic_ 6 7
ambiguous_individual_loneliness_romantic_ 6 6
enduring_social_loneliness 5 6
transient_individual_loneliness_romantic_somatic 6 6
ambiguous_individual_loneliness_seekinginteraction_ 4 5
ambiguous_social_loneliness_romantic_seekinginteraction 3 5
transient_social_loneliness_romantic__somatic__seekinginteraction_ 5 5
ambiguous_individual_loneliness_somatic_seekinginteraction 4 4
ambiguous_social_loneliness_physical__romantic__somatic_seekinginteraction 4 4
transient_social_loneliness 4 4
transient_social_loneliness_seekinginteraction_ 4 4
ambiguous_social_loneliness_romantic__somatic__seekinginteraction_ 2 3
enduring_individual_loneliness_physical__romantic__somatic_ 3 3
transient_social_loneliness_physical__romantic__somatic_seekinginteraction 3 3
transient_social_loneliness_physical_somatic 3 3
transient_social_loneliness_physical__somatic__seekinginteraction_ 3 3
ambiguous_individual_loneliness_physical__romantic__somatic_ 2 2
enduring_individual_loneliness_romantic_ 2 2
enduring_social_loneliness_romantic_somatic 2 2
transient_individual_loneliness_romantic_ 2 2
transient_individual_loneliness_romantic__somatic__seekinginteraction_ 2 2
ambiguous_individual_loneliness 1 1
ambiguous_social_loneliness_physical__romantic__seekinginteraction_ 1 1
ambiguous_social_loneliness_romantic_ 1 1
enduring_individual_loneliness_physical_romantic 1 1
enduring_individual_loneliness_physical__somatic__seekinginteraction_ 1 1
enduring_individual_loneliness_romantic_seekinginteraction 1 1
enduring_individual_loneliness_seekinginteraction_ 1 1
enduring_individual_loneliness_somatic_ 1 1
enduring_individual_loneliness_somatic_seekinginteraction 1 1
transient_individual_loneliness_physical__romantic__somatic_seekinginteraction 1 1
transient_individual_loneliness_somatic_ 1 1
transient_social_loneliness_romantic_somatic 1 1

 

Summary of Created Features

The tables below show the summary statistics for the features created in the main analysis dataset. This analysis does not cover these variables in further detail.

Derived Age Group Total Users Total Tweets
Adult 814 1010
Youth_Student 29 32
Children 11 21
Social Influence Total Users Total Tweets
Low Influence 590 733
High Influence > 1000 followers 255 330
Copyright Material Type Referenced Total Users Total Tweets
None 682 807
Other 86 148
Quote 51 57
SongLyric 34 37
FilmTVTitle 14 14
Social Context Total Users Total Tweets
Online 558 700
Offline 315 363

 

Emoji Analysis Dataset

 

Most Frequent Emojis by Analysis Groups

The series of barcharts below illustrate the most common emojis used and which Unicode block the emojis belong to.

The Unicode block offers an alternative category grouping to further describe the emojis.

Keeping in context, Tweets containing emojis make up nearly 20% of the analysis dataset. We could say, approximately 1 out of ever 5 Tweets contains one or more emojis.

A simple observation, there are a diverse set of emojis represented across the 6 analysis groups.

 

 

Heatmap of Emojis by Analysis Groups

Another method of representing the diversity and coverage of emojis used in Tweets about loneliness is to visualise the data as a heatmap.

Immediately, we can observe that the “Ambiguous Social Loneliness” group has the most diverse representation of emojis, followed by the “Enduring Social Loneliness” group. In contrast, “Transient Individual Loneliness” looks to be under represented, also with low sample size.

 

 

Emojis by Positive vs. Negative Sentiment and Analysis Groups

Sentiment analysis is a type of text mining which aims to determine the opinion and subjectivity of its content. When applied to Twitter data, the results can be representative of the Tweet created as well as the user’s individual influences. We can extend this same idea to represent emojis and emoticons.

Making use of the Emoji Sentiment Ranking v1.0 Dataset (Viewed 6 September 2020)[9] we can map the emojis in the analysis data to this Emoji Sentiment Ranking dataset to obtain sentiment scores and measure this here.

In the boxplots below, the emoji sentiment polarity ranges between negative 1 for negative sentiment, and positive 1 for positive sentiment. As a general observation the median sentiment score across all analysis groups ranges between 0.2 and 0.5, very slightly positive sentiment. A curious thought, is it possible that optimists experiencing loneliness are more likely to use emojis and emoticons to punctuate their Tweets? This could be a curious idea to explore later.

 

 

 

Text analysis dataset

 

Summary Statistics for the Text Analysis Dataset

Summary variable statistics are shown below for “wrk_100_text_analysis” which contains one row per Tweet with cleaned text and pre processing applied, followed by the summary statistics for “wrk_LNG_text_analysis” which contains one row per word token per Tweet and a variety of cleaned and stemmed words ready for analysis.

##   status_id             text            text_clean       
##  Length:1063        Length:1063        Length:1063       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  NUMSyuzhetTweetSentiment NUMNRTweetSentiment_anger
##  Min.   :-18.000          Min.   :0.000            
##  1st Qu.: -3.000          1st Qu.:1.000            
##  Median : -2.000          Median :1.000            
##  Mean   : -1.583          Mean   :1.195            
##  3rd Qu.:  1.000          3rd Qu.:2.000            
##  Max.   : 10.000          Max.   :6.000            
##  NUMNRTweetSentiment_anticipation NUMNRTweetSentiment_disgust
##  Min.   :0.000                    Min.   :0.000              
##  1st Qu.:0.000                    1st Qu.:1.000              
##  Median :0.000                    Median :1.000              
##  Mean   :0.587                    Mean   :1.101              
##  3rd Qu.:1.000                    3rd Qu.:1.000              
##  Max.   :5.000                    Max.   :5.000              
##  NUMNRTweetSentiment_fear NUMNRTweetSentiment_joy NUMNRTweetSentiment_sadness
##  Min.   :0.00             Min.   :0.0000          Min.   :0.000              
##  1st Qu.:1.00             1st Qu.:0.0000          1st Qu.:1.000              
##  Median :1.00             Median :0.0000          Median :1.000              
##  Mean   :1.59             Mean   :0.5268          Mean   :1.698              
##  3rd Qu.:2.00             3rd Qu.:1.0000          3rd Qu.:2.000              
##  Max.   :6.00             Max.   :6.0000          Max.   :7.000              
##  NUMNRTweetSentiment_surprise NUMNRTweetSentiment_trust
##  Min.   :0.0000               Min.   :0.0000           
##  1st Qu.:0.0000               1st Qu.:0.0000           
##  Median :0.0000               Median :0.0000           
##  Mean   :0.2879               Mean   :0.6312           
##  3rd Qu.:0.0000               3rd Qu.:1.0000           
##  Max.   :3.0000               Max.   :5.0000           
##  NUMNRTweetSentiment_negative NUMNRTweetSentiment_positive
##  Min.   :0.000                Min.   :0.000               
##  1st Qu.:1.000                1st Qu.:0.000               
##  Median :2.000                Median :1.000               
##  Mean   :2.022                Mean   :1.113               
##  3rd Qu.:3.000                3rd Qu.:2.000               
##  Max.   :7.000                Max.   :9.000
##   status_id             text           NUMSyuzhetTweetSentiment
##  Length:26663       Length:26663       Min.   :-18.000         
##  Class :character   Class :character   1st Qu.: -4.000         
##  Mode  :character   Mode  :character   Median : -2.000         
##                                        Mean   : -1.562         
##                                        3rd Qu.:  1.000         
##                                        Max.   : 10.000         
##                                                                
##  NUMNRTweetSentiment_anger NUMNRTweetSentiment_anticipation
##  Min.   :0.000             Min.   :0.0000                  
##  1st Qu.:1.000             1st Qu.:0.0000                  
##  Median :1.000             Median :1.0000                  
##  Mean   :1.321             Mean   :0.8057                  
##  3rd Qu.:2.000             3rd Qu.:1.0000                  
##  Max.   :6.000             Max.   :5.0000                  
##                                                            
##  NUMNRTweetSentiment_disgust NUMNRTweetSentiment_fear NUMNRTweetSentiment_joy
##  Min.   :0.000               Min.   :0.000            Min.   :0.0000         
##  1st Qu.:1.000               1st Qu.:1.000            1st Qu.:0.0000         
##  Median :1.000               Median :2.000            Median :0.0000         
##  Mean   :1.199               Mean   :1.796            Mean   :0.7298         
##  3rd Qu.:2.000               3rd Qu.:2.000            3rd Qu.:1.0000         
##  Max.   :5.000               Max.   :6.000            Max.   :6.0000         
##                                                                              
##  NUMNRTweetSentiment_sadness NUMNRTweetSentiment_surprise
##  Min.   :0.000               Min.   :0.0000              
##  1st Qu.:1.000               1st Qu.:0.0000              
##  Median :2.000               Median :0.0000              
##  Mean   :1.928               Mean   :0.3907              
##  3rd Qu.:2.000               3rd Qu.:1.0000              
##  Max.   :7.000               Max.   :3.0000              
##                                                          
##  NUMNRTweetSentiment_trust NUMNRTweetSentiment_negative
##  Min.   :0.0000            Min.   :0.00                
##  1st Qu.:0.0000            1st Qu.:1.00                
##  Median :1.0000            Median :2.00                
##  Mean   :0.8725            Mean   :2.35                
##  3rd Qu.:1.0000            3rd Qu.:3.00                
##  Max.   :5.0000            Max.   :7.00                
##                                                        
##  NUMNRTweetSentiment_positive     word              Valence     
##  Min.   :0.000                Length:26663       Min.   :0.000  
##  1st Qu.:0.000                Class :character   1st Qu.:0.438  
##  Median :1.000                Mode  :character   Median :0.646  
##  Mean   :1.519                                   Mean   :0.590  
##  3rd Qu.:2.000                                   3rd Qu.:0.757  
##  Max.   :9.000                                   Max.   :1.000  
##                                                  NA's   :17064  
##     Arousal        Dominance     word_stemmed       stem_completed_word
##  Min.   :0.073   Min.   :0.045   Length:26663       Length:26663       
##  1st Qu.:0.310   1st Qu.:0.409   Class :character   Class :character   
##  Median :0.407   Median :0.524   Mode  :character   Mode  :character   
##  Mean   :0.439   Mean   :0.515                                         
##  3rd Qu.:0.548   3rd Qu.:0.618                                         
##  Max.   :0.971   Max.   :0.991                                         
##  NA's   :17064   NA's   :17064

 

Procedure for Creating Word Tokens

When preparing the text analysis datasets, the following steps were applied to the Tweet text:

  1. Use the QDAP package to automatically scan for text issues clean up incomplete sentences, brackets, punctuation, blank spaces and other text anomalies. QDAP is also used to convert numbers, contractions, abbreviations and symbols into full words. This gives us more precise word data to analyse.

  2. Use the tm package to convert all text to lower case. This helps to reduce the sparsity of the text analysis.

  3. Manually identify any “undesirable words” in the context of the Tweet analysis. In this case we need to carefully consider if we keep “lonely” and “loneliness”, in some aspects as it may add noise to some analysis components such as unsupervised machine learning and clustering.

  4. Remove “stop words”, such as “a”, he“,”she“,”they" and “I”, as this clutters the text analysis in the search for meaningful verbs and nouns.

  5. Create individual response word tokens, split the responses into individual words, using the tidytext package.

  6. Left join on by “word”, the numerical values for word valence, arousal and dominance.

  7. Apply stemming to the word tokens, we want to reduce the sparsity of the individual words, again to de clutter the text analysis in the search for meaningful words. The package SnowballC was used for stemming.

 

Notes on Sentiment Analysis and NLP

Sentiment analysis is a type of text mining which aims to determine the opinion and subjectivity of its content. When applied to Twitter data, the results can be representative of the Tweet created as well as the user’s individual influences.

Natural Language Processing (NLP) is another methodology used in mining text. It tries to decipher the ambiguities in written language by tokenization, clustering, extracting entity and word relationships, and using algorithms to identify themes and quantify subjective information.

At this point, there are some important questions to consider before commencing with sentiment analysis and machine learning:

  1. Is further data preparation required?
  2. Are the stemmed words appropriate to use in further analysis or do we need to use the raw words?
  3. How will we approach negation in sentence constructs? “I am not happy and I don’t like it”, can change it’s sentiment and meaning very quickly when we remove key words like “not” and “don’t”. Negation is a very important consideration for this text analysis.
  4. Advanced concept in sentiment analysis: Is it appropriate to simply replace some certain words with more frequently used synonyms (semantically similar peers) and/or hypernyms (common parents)? This would be used to address lexicon word matching challenges, between the words in the text versus the lexicon used.
  5. Do we need to construct our own lexicon for the sentiment analysis, or will off-the-shelf packaged lexicons be appropriate to use?

 

Most Frequent Words (Stop-Words Removed)

To extract the words from the text of each Tweet we need to use several functions from the tidytext package. First we remove ampersand, greater-than and less-than characters, URLs and emoji from the text, then we tokenise the text into a row per word format, filter out stop words such as “the”, “of”, and “to”, translate any numbers and filter out hashtags and mentions of usernames. Then we select the variables of interest, count the frequency of each word and sort in descending order.

We can visualise the raw word counts using word clouds. The intention here is to get an initial, basic view as to the common words for each Tweet across the entire analysis dataset.

For this section the wordcloud and wordcloud2 packages will be used to create the word clouds for the dataset.

Word from Tweet Total Word Count
people 143
feel 136
friends 75
covid 68
time 62
feeling 61
lockdown 60
life 57
sad 57
health 51
mental 48
social 44
isolation 43
hard 40
love 39
dont 38
day 35
home 35
family 34
ffgald 31

 

On initial observation of these word lists, “people”, “feel” and “friends” paints a very communal picture of the nature of many of the Tweets.

 

Lexical Diversity (Vocabulary)

We can explore the vocabulary used in Tweets, we will refer to this as “lexical diversity”.

A curious subjective question at this point is: Could the larger the vocabulary for a Tweet (and potentially therefore the user) be an indicator of great story telling, Tweet follower emotional response evocation and Tweet response activation? This could be an opportunity for subsequent analysis.

In calculating the lexical diversity the pre-requisites are to:

  • Remove stop words
  • Work with a dataset which is one row per word (un-nested, token by word)
  • Group by Analysis Groups
  • Count by distinct words used in each Tweet
  • Visualise using a boxplot

The table below summarises the vocabulary count for each Tweet, aligned to its analysis group. The table is interactive and searchable.

 

 

 

In this visualisation, each dot represents an individual Tweet, plotted by its total distinct (unique) count of words used. The black horizontal lines measure the average count of distinct words used for the analysis group.

On observation, the “Enduring Social Loneliness” group has a marginally higher lexical diversity than the other five groups, closely followed by “Ambiguous Social Loneliness” and “Enduring Individual Loneliness”.

 

 

Wordcloud for All Tweets

We can visualise the stemmed word counts using word clouds. The intention here is to get an initial, basic view as to the common words used in Tweets. By using the stemmed words instead of the raw word counts we reduce the sparsity of the number of distinct words to observe. For example, stemming will reduce “people” to “peopl” and in a wordcloud this reduces the number of visualised words from two to one, de-cluttering the visual.

We do need to be careful with the usage and context of these visualisations here because some responses use the same words in repetition for emphasis and persuasiveness (for example a response containing “engage, engage, engage” will count as 3x “engage” and not 1x), where as more succinct responses may be under represented here.

For this section the wordcloud and wordcloud2 packages will be used to create the word clouds.

 

 

 

 

NRC Emotional Sentiment for All Tweets

There are different methods which can be used for sentiment analysis. For this analysis we will explore the Tweets using a predefined lexical dictionary (lexicon) named NRC.

NRC is the Word-Emotion Association Lexicon. It assigns words into one or more of ten categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

We will use the tidytext package and call in the NRC lexicon by using the get_sentiments() function when creating our data frame.

Let’s observe how the NRC lexicon triages the Tweets into the emotional sentiments.

 

Wordcloud by NRC Emotional Sentiment for All Tweets

This visualisation paints an “emotional profile”" picture for all Tweets in the analysis dataset. This illustrates for us that many Tweets are using words describing “surprise” and “joy” sentiments, even though the common word between all the Tweets is “loneliness”.

 

NRC Emotional Sentiment (Word Table)

The densely populated visualisation below illustrates what words are being triaged into which emotional sentiment grouping.

This can be summarised further to word counts by emotional sentiment group for each analysis group.

 

 

NRC emotional sentiment by group

 

Term Frequency Inverse Document Frequency (TF IDF)

This section focuses on quantifying how important various Tweets with stemmed words are across the overall analysis dataset.

The Term Frequency - Inverse Document Frequency (TF-IDF for short), is a measure of:

Term Frequency * Inverse Document Frequency

  • The Term Frequency, the number of times a word is counted in a document

Multiplied by

  • 1/DF, or 1 divided by the number of documents that contain each word

With the TF and IDF combined, a word’s importance is adjusted for how rarely it is used. The assumption with TF-IDF is that words that appear more frequently in a document should be given a higher weighting, unless the word also appears in many documents (Tweets).

For this analysis we can use TF-IDF to identify which words are important to each of the analysis groups in the dataset. We expect the analysis groups to differ in terms of subject, topic, content and sentiment, we therefore expect the frequency of words to differ between analysis groups, the TF-IDF metric will highlight these differences.

 

 

 

With respect to the objective of this analysis, we can clearly see that the top words of significance for each analysis group is providing some useful actionable insight into unique words which can describe each group. Here are some notable mentions:

  • The “Ambiguous Individual Loneliness” group contains words describing relationships; marriage, child, buddy, a person’s name “Jenni” and “xxx” could be translated as affection or a show of support for another person.
  • The “Ambiguous Social Loneliness” group contains words describing communication about health issues; COVID and FFALD (Friends for Good Australian Loneliness Dialog).
  • The “Enduring Social Loneliness” group contains words describing the negative responses and impacts of the COVID-19 pandemic and includes mentions of friends and children.

This measure is performing as expected and doesn’t tell us anything new. However, this can still be useful information to be aware of prior to designing and training any models and exploring topic modelling.

 

 

Tweet Sentiment (Positive vs. Negative) by Analysis Groups

Sentiment analysis is a type of text mining which aims to determine the opinion and subjectivity of its content. When applied to Twitter data, the results can be representative of the Tweet created as well as the user’s individual influences.

By using the syuzhet package we can map individual words to positive and negative sentiment scores.

In the boxplots below, the Tweet text sentiment polarity ranges between negative 1 for negative sentiment, and positive 1 for positive sentiment. As a general observation the median sentiment score across all analysis groups ranges between 0.2 and 0.6, very slightly positive sentiment. It is curious to observe that this range of median sentiments is consistent with the range we observed earlier in the Emoji sentiments, which used a different method and algorithm for calculating sentiment scores.

The text polarity ranges between negative 1 for negative sentiment, and positive 1 for positive sentiment.

 

 

 

Russell Effect (Valence) by groups

We can evaluate the emotional expression within the Tweet texts. A rating scale can be applied to individual words to provide a commonly used measure of emotional expression. These scales (between 0 and 1) assume that emotions are multi-dimensional psychological states that can be decomposed into three core dimensions; valence, arousal and dominance. Each word represents a point in a multi-dimensional space that maps out emotion as distinct psychological states.

  • Valence: The positivity of the emotions invoked by a word, going from unhappy to happy. Zero to one.

    • Arousal: The degree of arousal or energy evoked by a word, ranging from zero to one.

    • Dominance: The power of the word, the extent to which the word denotes something that is weak/submissive or strong/dominant. Ranging from zero to one.

We can plot each analysis group and Tweet using the emotion in text measures, effectively measuring the “Russell Effect”.

 

 

Key take aways from this visualisation

  • Key emotional landmarks in this visual: Top left = frustrated & distressed, Bottom left = ashamed & depressed, Top right = ambitious & excited, Bottom right = relaxed & peaceful.

    • There is a diverse representation of emotion across each analysis group. Each response was averaged using the valence, arousal and dominance scales.

    • In the “Ambiguous Social Loneliness” group there appears to be a larger number of Tweets grouped in the upper area of the bottom right quadrant, this generally represents emotions of worry, longing, attentiveness, pensiveness, expectant and hopeful.

There is opportunity to explore this section further.

 

Bi-Grams and Tri-Grams

Earlier in this analysis single word (or unigram) frequency counts were explored. This section is dedicated to exploring what precedes and follows the most common words that have been identified in the collection of Tweets

There are a few ways to approach n-gram analysis, two options are described below:

  1. Using the tidytext package and unnest_tokens() function to perform a simple and quick “blind” n-gram analysis. For bi-gram and tri-gram analysis, this will recount ngram parts and potentially skew the results.
  2. Using the tm and RWeka packages to perform a more robust, intelligent n-gram analysis. This is a more complex approach versus option 1. It uses a VCorpus and a DocumentTermMatrix. The ID attribute will need to be retained to re-identify the grouping variables, i.e: which n-gram belongs to which Tweet and analysis group, to compare the results.

For this section, Option 2 will be used for the n-gram analysis, it is important to inspect with greater resolution the subject matter that the ngrams will reveal.

 

 

 

Just like the NRC emotional sentiment, the bi-grams and tri-grams also begin to paint a picture that each analysis group has its own “ngram frequency profile”.

It is important to note here that the frequency counts are very low for both the bi-grams and tri-grams across all analysis groups. However, these visualisations represent the top 10 for bi-grams and tri-grams for each analysis group and factually, the entire lists of bi-grams and tri-grams is very lengthy but in small frequencies (high cardinality, low frequency). Perhaps this is an indication of the unique and complex nature that people write Tweets, everyone has a different communication style and the order, sequence and delivery of a message is different.

 

Bi-Gram Network

So far we’ve considered words as individual units, and considered their relationships to sentiments and analysis groups. However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same analysis groups.

In this section we will visualise a network of common n-grams discovered in the Tweets

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph.” Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes.

A graph can be constructed from a tidy object since it has three variables:

from: the node an edge is coming from

to: the node an edge is going towards

weight: A numeric value associated with each edge

The igraph package has many powerful functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n):

From the earlier inspection of unigram words with high frequencies, the following high-frequency “centering words” have been selected to visualise the Bi-Gram Network:

"feel","peopl","health","friend","anxiety","connect", "home"

 

 

Note that this is a visualisation of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. In this case, a random generator following this model might spit out “people”, then “friends”, then “mate/buddy/family”, by following each word to the most common words that follow it. To make the visualization interpretable, we chose to show only the most common word to word connections, but one could imagine an enormous graph representing all connections that occur in the text.

 

Exploratory Graph Analysis (EGA) Network

Exploratory Graph Analysis (EGA) is a form of network analysis which creates word clusters based on correlation. Words which are most correlated with one another will appear in the same cluster (also known as a dimension), it is then up to the researcher and analyst to interpret the results.

To setup to perform the EGA, firstly the text analysis dataset is transformed into a corpus using the tm package. From there, the SentimentAnalysis package is used and the function analyzeSentiment() is applied to extract the sentiment scoring for each Tweet. The results from analyzeSentiment contain 13 sentiment scores from four different dictionaries: GI, HE, LM, and QDAP. Running the analyzeSentiment() function compares the text of the Tweets with these various dictionaries, looking for word matches. When there are matches with positively or negatively categorized words, the Tweet is given a corresponding sentiment score, which is located in the output data frame.

Next, the output dataframe is reduced to contain only the sum sentiment analysis results, that means removing the unnecessary accompanying “Negativity” and “Positivity” measures, leaving only the measures which will be used to cluster the words within Tweets. Once performed, the data we are working with is smaller and easier to understand.

Having no theoretical reason to rely on any of these dictionaries more than the others, we can now create a mean value for each Tweet’s sentiment level, resulting in a single sentiment value for each Tweet. A new variable is created which takes the average of every row in the dataframe.

To perform the EGA, we load the devtools package and install EGA using devtools::install_github("hfgolino/EGA"). The EGA package was created by Hudson Golino and can be sourced from his GitHub account.

Having installed and loaded the required packages, we now use the command EGA() to initially generate the clusters, using the prepared dataframe as input. The results of this will generate a visual indicating the words in each network node, with the colours demarcating clusters.

 

#==================== Exploratory Graph ANALYSIS (EGA) ===================
#https://uvastatlab.github.io/2019/05/03/an-introduction-to-analyzing-twitter-data-with-r/
EGA_TextNetworkAnalysis <- as.data.frame(wrk_100_text_analysis) %>%
  dplyr::select(status_id, text_clean) 

library(tm)
text_corpus <- Corpus(VectorSource(wrk_100_text_analysis$text_clean))
text_corpus <- tm_map(text_corpus, tolower)
text_corpus <- tm_map(text_corpus, removeWords, c("shit","fuck","ive","im", "lonely","loneliness","i","me","my","his","her","they","u","myself","your"))
text_corpus <- tm_map(text_corpus, removeWords, stopwords("english"))
text_corpus <- tm_map(text_corpus, removePunctuation)
text_corpus <- tm_map(text_corpus, stemDocument)
text_df <- data.frame(text_clean_corpus = get("content", text_corpus), stringsAsFactors = FALSE)
EGA_TextNetworkAnalysis <- cbind.data.frame(EGA_TextNetworkAnalysis, text_df)

library(SentimentAnalysis)
sentiment <- analyzeSentiment(EGA_TextNetworkAnalysis$text_clean_corpus)

sentiment <- dplyr::select(sentiment, 
                           SentimentGI, SentimentHE,
                           SentimentLM, SentimentQDAP, 
                           WordCount)

sentiment <- dplyr::mutate(sentiment, mean_sentiment = rowMeans(across(where(is.numeric))))

sentiment <- dplyr::select(sentiment, WordCount, mean_sentiment)

wrk_100_text_analysis_EDA <- cbind.data.frame(EGA_TextNetworkAnalysis, sentiment)

#nrow(wrk_100_text_analysis_EDA)
#1063

library(quanteda)
tokenized_list <- tokens(wrk_100_text_analysis_EDA$text_clean_corpus)

Tweet_dfm <- dfm(tokenized_list)
word_sums <- colSums(Tweet_dfm)
#length(word_sums)
#2190

freq_data <- data.frame(word = names(word_sums), 
                        freq = word_sums, 
                        row.names = NULL,
                        stringsAsFactors = FALSE)

sorted_freq_data <- freq_data[order(freq_data$freq, decreasing = TRUE), ]

corpus_tm <- Corpus(VectorSource(wrk_100_text_analysis_EDA[,3])) #col 2 is text_clean_corpus

Tweet_dtm <- DocumentTermMatrix(corpus_tm)

Tweet_dtm <- removeSparseTerms(Tweet_dtm, 0.98)

df_cluster <- as.data.frame(as.matrix(Tweet_dtm))

#Now, we will create clusters using a technique called 
#Exploratory Graph Analysis (EGA, for short). 
#Install and load the package devtools and EGA.
#EGA is a form of network analysis which creates word 
#clusters based on correlation. Words which are most correlated 
#with one another will appear in the same cluster (also called a dimension)
#and it is up to the user to interpret these results. 
#This package was created by Hudson Golino and can be sourced 
#from his GitHub account, as displayed below

#library(devtools)
#devtools::install_github("hfgolino/EGA")
library(EGAnet)

ega_cluster <- EGA(df_cluster)

#ega_cluster$dim.variables #WHat words belong to clusters?

#ega_summary <- as.data.frame(ega_cluster$dim.variables)

#ega_summary %>%
#  datatable(class = "cell-border stripe", filter = 'top', caption = 'Exploratory Graph Analysis - Clusters and Words',
#            rowname = FALSE, options = list(autoWidth = TRUE, searching = TRUE, pageLength = 10)) 


#ega_summary_correlation <- as.data.frame(ega_cluster$correlation)

#ega_summary_correlation %>%
#  datatable(class = "cell-border stripe", filter = 'top', caption = 'Exploratory Graph Analysis - Correlations',
#            rowname = FALSE, options = list(autoWidth = TRUE, searching = TRUE, pageLength = 10)) 

#ega_summary_network <- as.data.frame(ega_cluster$network)

#ega_summary_network %>%
#  datatable(class = "cell-border stripe", filter = 'top', caption = 'Exploratory Graph Analysis - Network',
#            rowname = FALSE, options = list(autoWidth = TRUE, searching = TRUE, pageLength = 10)) 

 

 

While this 8-clustered graphic is pretty to look at, it often abbreviates the words and thus is not as informative as it could be. Thus, we select dim.variables from within the EGA clustering object to tell us which words belong to which cluster. We have to construct our own meaning from these clusters;

for example, both clusters 1, 2 and 6 have the most numbers of words included in their respective clusters. Cluster 1 contains a high number of words which could be considered as negative expressions and consequences related to the COVID-19 pandemic, while Cluster 2 contains words suggesting temporal expressions of emotion and experience. Cluster 6 contains words related to loneliness awareness advocacy, such as the Friends for Good Australian Loneliness Dialog.

The results of this EGA help with obtaining a better sense of what expressions of loneliness exist and the common groups which begin to emerge beyond only looking at the frequencies and assigned analysis groups.

 

EGA Clusters and Included Words Definition

ega_summary <- as.data.frame(ega_cluster$dim.variables)
ega_summary %>%
  DT::datatable(class = "cell-border stripe", filter = 'top', caption = 'Exploratory Graph Analysis - Clusters and Words',
            rowname = TRUE, options = list(autoWidth = TRUE, searching = TRUE, pageLength = 10)) 

 

Tweet Similarity (Trans-Venn Diagram)

Using the QDAP package we have access to a function called trans_venn(). This will facilitate a a view of similarity between Tweets and their respective analysis groups, visualised as a venn diagram.

 

This visualisation could be interpreted as more a novelty than providing any solid metric to meet the objectives of this analysis. This visualisation does illustrate that there is strong similarity overlap between the analysis groups and Tweets.

In deriving any meaning from this venn diagram, it appears:

  • Approximately half of “Ambiguous Social Loneliness” group overlaps and intersects through all other analysis groups, it also has a significant portion which does not intersect any other group. This group shares many similarities as it highlights differences in Tweet content.
  • The “Enduring Social Loneliness” group appears to behave in a similar way to “Ambiguous Social Loneliness”, except spanning in the opposite direction
  • Purely based on visual similarity, “Transient Social Loneliness”, “Ambiguous Individual Loneliness”, “Transient Individual Loneliness” and “Enduring Individual Loneliness” appear to be subsets of “Enduring Social Loneliness” and “Ambiguous Social Loneliness”.

 

Similarity dendogram

Another viewpoint on measuring and illustrating similarity between Tweets and analysis groups is to utilise the QDAP package and the Dissimilarity() function. We can perform dissimilarity statistics, using the distance function to calculate dissimilarity statistics by the analysis groups.

The Dissimilarity() function will return a matrix of dissimilarity values, which is the agreement between Tweet text. We will plot this matrix as a dendogram and identify some potential clusters.

In this section we will use each of the analysis grouping variables that were created in earlier sections of this analysis and observe the similarities here.

 

 

Dendogram for the Primary Analysis Groups

Dendogram for the Secondary Analysis Groups (Loneliness Contexts)

Dendogram for the Teriary Analysis Groups (Extended Contexts)

Dendogram for the Combined Analysis Groups (Primary + Secondary)  

The coloured red, blue and green boxes indicate the clustering boundary for groups represented in the dendogram.

It is curious to observe the bottom dendogram with the combined analysis groups. There are groups with distinctly different names in same clustered groups. We can view these distinctive combined analysis groups as “headlines” for people experiencing loneliness, and the content within these “headlines” may be much more similar than on face value of the headline itself.

 

8. Automated Data Exploration

Unsupervised Machine Learning

This section is dedicated to making further sense of the unstructured nature of text data, or in this case, Tweets. In previous sections we utilised some clustering methods to automatically find groups within the Tweet text data. Here, we can use unsupervised machine learning to automatically explore the text data to help reveal underlying patterns and groups within the Tweets, beyond what we have explored in previous sections.

Topic modelling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for or what our target is.

Some questions for guiding this section are:

  • Can we identify any meaningful groups or themes in our collection of Tweets and analysis groups?
  • Which words contribute to which topics?
  • Which topics contribute to which analysis groups?

For this piece, we will use Latent Dirichlet Allocation (LDA).

The packages quanteda, topicmodels and broom will be used.

This section will offer a different view point on topics and themes to that from the sentiment, EGA, and similar/dissimilar analysis.

 

Latent Dirichlet Allocation (LDA)

The topicmodels package will be used to create an LDA model.

How many topics will we tell the algorithm to make? This is a question much like in k-means clustering; we don’t really know ahead of time.

We can try a few different values and see how the model is doing in fitting the text. Let’s start with 10 topics.

This is a stochastic algorithm that could have different results depending on where the algorithm starts, so a seed value is needed for reproducibility.

## A LDA_VEM topic model with 10 topics.

 

With the LDA model object created, it can now be explored. Using the broom package to explore the model object content.

 

 

In the table below, column β (beta) tells us the probability of that term being generated from that topic for that document. Notice that some are very low numerical values, and some are not so low. This information helps us to consider what parameters to use to instruct the algorithm to find topic groups within the Tweet text data.

Next, we’ll inspect the top 10 terms in each LDA topic.

 

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms
library(ggplot2)
library(ggstance)
library(ggthemes)
ggplot(top_terms, aes(beta, term, fill = as.factor(topic))) +
  geom_barh(stat = "identity", show.legend = FALSE, alpha = 0.8) +
  labs(title = "Top 10 Terms in Each LDA Topic",
       subtitle = "Topic modeling of Tweets about Loneliness",
       y = NULL, x = "beta") +
  facet_wrap(~topic, ncol = 4, scales = "free") +
  ggthemes::theme_tufte(base_family = "Arial", base_size = 13, ticks = FALSE) +
  scale_x_continuous(expand=c(0,0)) +
  theme(strip.text=element_text(hjust=0)) +
  theme(plot.caption=element_text(size=9))

 

 

Further exploration is definitely needed to find the right number of topics and to do a better job here. Noting, the beta scores are very low in many of the terms for some of the topics shown in the graph.

Next, we will inspect which topics are associated with which documents.

 

lda_gamma <- tidy(desc_lda, matrix = "gamma")
lda_gamma

 

The column γ (gamma) here is the probability that each document belongs in each topic. The closer to 1 the score is, the more the document belongs exclusively to the specific topic.

Notice that some are very low and some are higher. Observe how the probabilities are distributed here.

 

ggplot(lda_gamma, aes(gamma, fill = as.factor(topic))) +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~topic, ncol = 4) +
  scale_y_log10() +
  labs(title = "Distribution of Probability for Each Topic",
       subtitle = "Topic modeling of Tweets about Loneliness",
       y = NULL, x = "gamma") +
  theme_minimal(base_family = "Arial", base_size = 13) +
  theme(strip.text=element_text(hjust=0)) +
  theme(plot.caption=element_text(size=9))

 

The y-axis is plotted here on a log scale so we can observe the distribution easier.

Most documents are getting sorted into one of these topics with fair probability, however there appears to be a situation where many documents are shared amongst several topics and the allocation of documents belonging to a single topic is less likely. This could be driven by the perspective that the social and communicative nature of Tweets about loneliness includes many facets and connects with many themes and topics.

 

LDA_word_count_format <- LDA_word_counts %>%
  mutate(doc_id = as.character(doc_id))

lda_gamma <- full_join(lda_gamma, LDA_word_count_format, by = c("document" = "doc_id"))
lda_gamma

 

Let’s keep each document that was modelled as belonging to a topic with a probability of more than 0.2, and then find the top keywords for each topic where the count for those keywords is more than 15.

 

top_keywords <- lda_gamma %>% 
  filter(gamma > 0.2) %>% 
  group_by(topic, word) %>% 
  count(word, sort = TRUE) %>%
  filter(n >= 15)
top_keywords

 

top_keywords <- top_keywords %>%
  top_n(2, n)
ggplot(top_keywords, aes(n, word, fill = as.factor(topic))) +
  geom_barh(stat = "identity", show.legend = FALSE, alpha = 0.8) +
  labs(title = "Top keywords for Each LDA Topic",
       subtitle = "Topic modeling of Tweets about Loneliness",
       y = NULL, x = "Number of Tweets") +
  facet_wrap(~topic, ncol = 4, scales = "free") +
  scale_x_continuous(expand=c(0,0)) +
  theme(strip.text=element_text(hjust=0)) +
  theme(plot.caption=element_text(size=9))

 

These are really interesting combinations of keywords. More testing and iteration is required to explore the robustness and fit for purpose of this model. There are some distinctive words representing each topic, while some words are shared across several topics, such as “feel”, “people” and “sad”.

 

9. Findings and Opportunities

This analysis provided a comprehensive inspection of Twitter Tweets containing expressions of loneliness.

Our achievements from this analysis were:

  • Creating a point in time data capture and view as to the number of people in the Geelong region using social media, specifically Twitter, to communicate expressions of loneliness.
  • Successfully applied international social science research to quantify and measure expressions of loneliness on social media.
  • Highlighting some useful methods for including complex text data such as emojis and emoticons in the assessment of communicated sentiments.
  • Highlighting several methods for manually and automatically discovering hidden patterns, clusters and groups within the Tweet text data.

 

We learned from this analysis:

  • That there are aspects of the expressed loneliness context which were beyond the scope of the applied research, this relates to the “advocates” of creating loneliness awareness, the people who use loneliness as a “projectile” and the impacts and consequences of the COVID-19 pandemic.
  • With careful technical consideration, complex text data such as Unicode emojis and emoticons can be incorporated to add another dimension and perspective to text analysis.
  • Though not always feasible due to the volume of large data, with the small dataset used in this analysis we could manually observe each record, scrutinise and apply categorical and descriptive labels to the data to enhance the analysis.
  • We do not have evidence in this study that pertain to recovery, or evidence that people’s expressions of loneliness persists or decays over time.
  • We explored explicit expressions of loneliness and in the nature of the sampling, we did not include expressions of loneliness that were not as straightforward as the phrases queried.
  • The expressions of loneliness vary greatly between people, how one person phrases the experience is very rarely the same as the next person. These are observations and we cannot claim to have described the actual experience of loneliness. We may also be missing nuances in communication practices around loneliness that can be introduced by cultural variation.
  • We explored communicated disclosures, which may not always imply genuine experiences of loneliness.

 

We observed further opportunities in these areas:

  • Explore the options for creating a #hashtag to represent elevating awareness of loneliness in Geelong and aim to drive its trending on social media.
  • More thoroughly inspect the analysis grouping variables created for this analysis and determine how effectively they describe expressions of loneliness.
  • Exploring the key words associated with selecting emojis and emoticons on social media platforms to punctuate expressions of loneliness.
  • Expand similar analyses to other more popular social media platforms such as Facebook, LinkedIn and Instagram.
  • Acknowledging the low sample size of this analysis, more data is needed to form more robust conclusions. The analysis data presented here can augment and enhance existing social and health science data to strengthen supports for improving the loneliness situation of the Geelong region and nationwide.
  • The emotion and feeling triggers for loneliness could be explored further via observations of linked multimedia content on social media, such as books, songs, videos, memes, poetry and artwork and measuring the evoked emotional responses from users in the social network.
  • There may be merit in exploring social influence, we can identify users with high numbers of followers and observe patterns and behaviours in tweet content and follower responses. What would happen if a high profile social influencer changed their views on loneliness?
  • The short time frames of this analysis meant that a fully fledged longitudinal study was not possible to achieve. This analysis observed nearly 1 loneliness tweet to 1 user over a 6 week time frame. We cannot ascertain the historic and post-analysis loneliness expression trajectory or loneliness status for these people. What would happen if there was a longer timeline data capture?
  • The data prepared in this analysis could be used in machine learning applications to classify and label more collected tweets containing mentions of loneliness.

 

10. Next Steps

Present the findings from this analysis to the project team, sponsor and stakeholders.

 

11. References

[1][Ruiz, C, et al (2017).Loneliness in a Connected World: Analyzing Online Activity and Expressions on Real Life Relationships of Lonely Users. The AAAI 2017 Spring Symposium on Wellbeing AI: From Machine Learning to Subjectivity Oriented Computing (online).Technical Report SS-17-08 (viewed 6 September 2020)](https://aaai.org/Library/Symposia/Spring/ss17-08.php)

[2][Kivran-Swaine, F, et al (2014). Understanding Loneliness in Social Awareness Streams: Expressions and Responses (viewed 6 September 2020)] (http://www.jedbrubaker.com/wp-content/uploads/2008/05/KivranSwaine-ICWSM-LonelyTweets.pdf)

[3][Trafford Data Lab (2020). Exploring Tweets in R. (Viewed 6 September 2020)](https://medium.com/@traffordDataLab/exploring-Tweets-in-r-54f6011a193d)

[4][Desai, M (2018). Sentiment Analysis using R and Twitter. (Viewed 6 September 2020)](https://www.tabvizexplorer.com/sentiment-analysis-using-r-and-twitter/)

[5][University of Virginia (2019). An Introduction to Analyzing Twitter Data with R. (Viewed 6 September 2020)](https://uvastatlab.github.io/2019/05/03/an-introduction-to-analyzing-twitter-data-with-r/)

[6][Unicode. Emoji List Version 13.0. (Viewed 6 September 2020)](http://unicode.org/emoji/charts/emoji-list.html)

[7][Emoji graphical icon repository](https://twemoji.maxcdn.com/2/72x72/)

[8][Kralj Novak, P et al (2015). Emoji Sentiment Ranking 1.0. (Viewed 6 September 2020)](https://www.clarin.si/repository/xmlui/handle/11356/1048)

[9][Emoji Sentiment Ranking v1.0 Dataset (Viewed 6 September 2020)](http://kt.ijs.si/data/Emoji_sentiment_ranking/index.html)

[10][Özyıldırım, E (2019). Subjective Value Assessment Based on Emojis for Applications in Landscape and Urban Planning. (Viewed 6 September 2020)](https://cartographymaster.eu/wp-content/theses/2019_Ozyildirim_Thesis.pdf)

[11][Quan, C (2017). The Data Files: Twitter Emoji Analysis. (Viewed 6 September 2020)](https://code.likeagirl.io/the-data-files-twitter-emoji-analysis-7703ea2a6238)

[12][[KÜHNE, D.C.S (2019). Collecting and Analyzing Twitter Data Using R. (Viewed 6 September 2020)] Collecting and Analyzing Twitter Data Using R](https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/collecting-and-analyzing-twitter-using-r/)

[13][Concept: Latent Dirichlet Allocation](https://rpubs.com/juliasilge/201707)

[14][Concept: Applying the Russell Effect & Emotion Dynamics](https://www.r-bloggers.com/a-model-and-simulation-of-emotion-dynamics/)

[15][Concept: SentimentR Package used to calculate sentiments in this analysis](https://cran.r-project.org/web/packages/sentimentr/readme/README.html)

 

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggthemes_4.2.0          broom_0.5.6             topicmodels_0.2-11     
##  [4] EGAnet_0.9.7            quanteda_2.1.1          SentimentAnalysis_1.3-3
##  [7] sentimentr_2.7.1        koRpus_0.11-5           sylly_0.1-5            
## [10] tm_0.7-7                NLP_0.2-0               syuzhet_1.0.4          
## [13] wordcloud_2.6           RColorBrewer_1.1-2      wordcloud2_0.2.1       
## [16] ggstance_0.3.4          ggimage_0.2.8           tidytext_0.2.4         
## [19] formattable_0.2.0.1     kableExtra_1.1.0        lubridate_1.7.9        
## [22] leaflet.providers_1.9.0 leaflet_2.0.3           data.table_1.12.8      
## [25] DT_0.14                 rprojroot_1.3-2         forcats_0.5.0          
## [28] stringr_1.4.0           purrr_0.3.4             readr_1.3.1            
## [31] tidyr_1.1.0             tibble_3.0.1            ggplot2_3.3.2          
## [34] tidyverse_1.3.0         knitr_1.29              dplyr_1.0.0            
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1           backports_1.1.8        Hmisc_4.4-1           
##   [4] BDgraph_2.63           fastmatch_1.1-0        plyr_1.8.6            
##   [7] igraph_1.2.5           splines_4.0.2          crosstalk_1.1.0.1     
##  [10] SnowballC_0.7.0        usethis_1.6.1          digest_0.6.25         
##  [13] foreach_1.5.0          htmltools_0.5.0        magick_2.4.0          
##  [16] fansi_0.4.1            magrittr_1.5           checkmate_2.0.0       
##  [19] cluster_2.1.0          modelr_0.1.8           RcppParallel_5.0.2    
##  [22] jpeg_0.1-8.1           qdapDictionaries_1.0.7 colorspace_1.4-1      
##  [25] blob_1.2.1             rvest_0.3.5            haven_2.3.1           
##  [28] xfun_0.15              crayon_1.3.4           jsonlite_1.7.0        
##  [31] iterators_1.0.12       survival_3.1-12        glue_1.4.1            
##  [34] stopwords_2.0          gtable_0.3.0           NetworkToolbox_1.4.0  
##  [37] webshot_0.5.2          abind_1.4-5            scales_1.1.1          
##  [40] mvtnorm_1.1-1          DBI_1.1.0              qdapRegex_0.7.2       
##  [43] Rcpp_1.0.5             viridisLite_0.3.0      htmlTable_2.0.1       
##  [46] tmvnsim_1.0-2          gridGraphics_0.5-0     foreign_0.8-80        
##  [49] Formula_1.2-3          textclean_0.9.3        stats4_4.0.2          
##  [52] htmlwidgets_1.5.1      httr_1.4.1             lavaan_0.6-7          
##  [55] modeltools_0.2-23      ellipsis_0.3.1         pkgconfig_2.0.3       
##  [58] farver_2.0.3           nnet_7.3-14            dbplyr_1.4.4          
##  [61] utf8_1.1.4             labeling_0.3           ggplotify_0.0.5       
##  [64] tidyselect_1.1.0       rlang_0.4.7            reshape2_1.4.4        
##  [67] munsell_0.5.0          cellranger_1.1.0       tools_4.0.2           
##  [70] cli_2.0.2              generics_0.0.2         fdrtool_1.2.15        
##  [73] evaluate_0.14          yaml_2.2.1             fs_1.4.2              
##  [76] glasso_1.11            pbapply_1.4-3          nlme_3.1-148          
##  [79] whisker_0.4            mime_0.9               slam_0.1-47           
##  [82] xml2_1.3.2             tokenizers_0.2.1       compiler_4.0.2        
##  [85] rstudioapi_0.11        png_0.1-7              huge_1.3.4.1          
##  [88] reprex_0.3.0           pbivnorm_0.6.0         stringi_1.4.6         
##  [91] qgraph_1.6.5           lattice_0.20-41        Matrix_1.2-18         
##  [94] psych_2.0.7            markdown_1.1           vctrs_0.3.1           
##  [97] pillar_1.4.4           lifecycle_0.2.0        BiocManager_1.30.10   
## [100] corpcor_1.6.9          R6_2.4.1               latticeExtra_0.6-29   
## [103] gridExtra_2.3          janeaustenr_0.1.5      lexicon_1.2.1         
## [106] codetools_0.2-16       gtools_3.8.2           MASS_7.3-51.6         
## [109] assertthat_0.2.1       rjson_0.2.20           withr_2.2.0           
## [112] mnormt_2.0.1           parallel_4.0.2         hms_0.5.3             
## [115] grid_4.0.2             rpart_4.1-15           rmarkdown_2.3         
## [118] rvcheck_0.1.8          d3Network_0.5.2.1      base64enc_0.1-3